Positions and intervals provide addresses for anything that can be mapped to a reference genome:
It is no wonder that positions and intervals are key fields in most genomic data files.
For example, next-generation sequencing (NGS) technologies can map all kinds of functional biomolecular information along this coordinate system.
Show a grey line - genome - with some genes, orange intervals - like on the figure below Then show another track - let’s say H3K4me3 (just say that that you can do experiment where you can measure gene activity or something like that) And then pose a question - we would like to find a subset of active genes in this cell line or whatever Therefore - OVERLAP ! Then you say that this is the most basic/fundamental operation in genomics
Genomic intervals are one of the most prevalent data structures in computational genome biology.
Interval arithmetic provides a language for asking questions about relationships between features.
There are excellent and widely used command-line tools, such as BEDTools for genome interval arithmetic on common text file formats (e.g. BED).
There is also a Python interface pyBedTools. However, it roundtrips data into and out of the command line program through text file intermediates, which is inefficient and places unnecessary limitations.
We sought to create a flexible, Pythonic implementation built entirely with numpy and pandas, with an API smoothly integrated with the Python data ecosystem.
In our implementation, genomic interval sets are simply dataframes.
No need to introduce higher-level data structures.
The core objects in bioframe are pandas DataFrames of genomic intervals, or BedFrames defined by chrom, start and end columns (the names are configurable). We don't use opaque wrapper objects or DataFrame subclasses.
import bioframe as bf
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)
display(df1)
bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
plt.title('bedFrame1 intervals');
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 1 | 5 |
| 1 | chr1 | 3 | 8 |
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
df2 = bf.from_any(
[['chr1', 4, 8],
['chr1', 10, 11]],
name_col='chrom'
)
display(df2)
bf.vis.plot_intervals(df2, show_coords=True, xlim=(0,16), colors='lightpink')
plt.title('bedFrame2 intervals');
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 4 | 8 |
| 1 | chr1 | 10 | 11 |
The most common operation is to calculate the overlaps between two sets of genomic intervals.
overlaps = bf.overlap(df1, df2, how='inner', suffixes=('_1','_2'))
overlaps
| chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | |
|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 4 | 8 |
| 1 | chr1 | 3 | 8 | chr1 | 4 | 8 |
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14],
],
columns=['chrom', 'start', 'end']
)
bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
df_annotated = bf.cluster(df1, min_dist=0)
display(df_annotated)
bf.vis.plot_intervals(df_annotated, labels=df_annotated['cluster'], show_coords=True, xlim=(0,16))
| chrom | start | end | cluster | cluster_start | cluster_end | |
|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 0 | 1 | 10 |
| 1 | chr1 | 3 | 8 | 0 | 1 | 10 |
| 2 | chr1 | 8 | 10 | 0 | 1 | 10 |
| 3 | chr1 | 12 | 14 | 1 | 12 | 14 |
Instead of returning cluster assignments,bioframe.merge returns a new dataframe of merged genomic intervals. As with bioframe.cluster, using min_dist=0 and min_dist=None gives different results.
If min_dist=0, this returns a dataframe of two intervals:
df_merged = bf.merge(df1, min_dist=0)
display(df_merged)
bf.vis.plot_intervals(df_merged, show_coords=True, xlim=(0,16))
| chrom | start | end | n_intervals | |
|---|---|---|---|---|
| 0 | chr1 | 1 | 10 | 3 |
| 1 | chr1 | 12 | 14 | 1 |
If min_dist=None, this returns a dataframe of three intervals:
df_merged = bf.merge(df1, min_dist=None)
display(df_merged)
bf.vis.plot_intervals(df_merged, show_coords=True, xlim=(0,16))
| chrom | start | end | n_intervals | |
|---|---|---|---|---|
| 0 | chr1 | 1 | 8 | 2 |
| 1 | chr1 | 8 | 10 | 1 |
| 2 | chr1 | 12 | 14 | 1 |
It is often useful not only to find features that overlap, but also features that are nearby along the genome.
closest_intervals = bf.closest(df1, df2, suffixes=('_1','_2'))
display(closest_intervals)
| chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | distance | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 4 | 8 | 0 |
| 1 | chr1 | 3 | 8 | chr1 | 4 | 8 | 0 |
| 2 | chr1 | 8 | 10 | chr1 | 4 | 8 | 0 |
| 3 | chr1 | 12 | 14 | chr1 | 10 | 11 | 1 |
By default, bioframe.closest reports overlapping intervals. This can be modified by passing ignore_overlap=True. Note the closest pair #2 and #3, which did not overlap, remain the same.
Closest intervals within a single DataFrame can be found simply by passing a single dataframe to bioframe.closest. The number of closest intervals to report per query interval can be adjusted with k.
bf.closest(df1, k=2)
| chrom | start | end | chrom_ | start_ | end_ | distance | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 3 | 8 | 0 |
| 1 | chr1 | 1 | 5 | chr1 | 8 | 10 | 3 |
| 2 | chr1 | 3 | 8 | chr1 | 1 | 5 | 0 |
| 3 | chr1 | 3 | 8 | chr1 | 8 | 10 | 0 |
| 4 | chr1 | 8 | 10 | chr1 | 3 | 8 | 0 |
| 5 | chr1 | 8 | 10 | chr1 | 12 | 14 | 2 |
| 6 | chr1 | 12 | 14 | chr1 | 8 | 10 | 2 |
| 7 | chr1 | 12 | 14 | chr1 | 3 | 8 | 4 |
Bioframe has two functions for computing differences between sets of intervals: at the level of basepairs and at the level of whole intervals.
Basepair-level subtraction is performed with bioframe.subtract:
subtracted_intervals = bf.subtract(df1, df2)
display(subtracted_intervals)
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 1 | 4 |
| 1 | chr1 | 3 | 4 |
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
bf.vis.plot_intervals(subtracted_intervals, show_coords=True, xlim=(0,16))
Interval-level differences are calculated with bioframe.setdiff:
setdiff_intervals = bf.setdiff(df1, df2)
display(setdiff_intervals)
| chrom | start | end | |
|---|---|---|---|
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
bf.vis.plot_intervals(setdiff_intervals, show_coords=True, xlim=(0,16))
For two sets of genomic features, it is often useful to calculate the number of basepairs covered and the number of overlapping intervals. While these are fairly straightforward to compute from the output of bioframe.overlap with pandas.groupby and column renaming, since these are very frequently used, they are provided as core bioframe functions.
df1_coverage = bf.coverage(df1, df2)
display(df1_coverage)
| chrom | start | end | coverage | |
|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 1 |
| 1 | chr1 | 3 | 8 | 4 |
| 2 | chr1 | 8 | 10 | 0 |
| 3 | chr1 | 12 | 14 | 0 |
df1_coverage_and_count = bf.count_overlaps(df1_coverage, df2)
display(df1_coverage_and_count)
| chrom | start | end | coverage | count | |
|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 1 | 1 |
| 1 | chr1 | 3 | 8 | 4 | 1 |
| 2 | chr1 | 8 | 10 | 0 | 0 |
| 3 | chr1 | 12 | 14 | 0 | 0 |
Equally important to finding which genomic features overlap is finding those that do not. bioframe.complement returns a BedFrame of intervals not covered by any intervals in an input BedFrame.
Complments are defined relative to a genomic view. Here this is provided as a dictionary with a single chromosome of length 15:
df_complemented = bf.complement(df1, view_df={'chr1':15})
display(df_complemented)
| chrom | start | end | view_region | |
|---|---|---|---|---|
| 0 | chr1 | 0 | 1 | chr1 |
| 1 | chr1 | 10 | 12 | chr1 |
| 2 | chr1 | 14 | 15 | chr1 |
bf.vis.plot_intervals(df_complemented, show_coords=True, xlim=(0,16), colors='lightpink')
If no view is provided, complement is calculated per unique chromosome in the input with right limits of np.iinfo(np.int64).max.
df_complemented = bf.complement(df1)
display(df_complemented)
| chrom | start | end | view_region | |
|---|---|---|---|---|
| 0 | chr1 | 0 | 1 | chr1 |
| 1 | chr1 | 10 | 12 | chr1 |
| 2 | chr1 | 14 | 9223372036854775807 | chr1 |
view_df = pd.DataFrame(
[
["chr1", 0, 4, "chr1p"],
["chr1", 5, 9, "chr1q"],
["chrX", 0, 50, "chrX"],
["chrM", 0, 10, "chrM"]],
columns=["chrom", "start", "end", "name"],
)
df_unsorted = pd.DataFrame([
['chrM', 3, 8],
['chrM', 1, 5],
['chrX', 12, 14],
['chrX', 8, 10]],
columns=['chrom', 'start', 'end']
)
bf.sort_bedframe(df_unsorted)
| chrom | start | end | |
|---|---|---|---|
| 0 | chrM | 1 | 5 |
| 1 | chrM | 3 | 8 |
| 2 | chrX | 8 | 10 |
| 3 | chrX | 12 | 14 |
bf.sort_bedframe(df_unsorted, view_df)
| chrom | start | end | |
|---|---|---|---|
| 0 | chrX | 8 | 10 |
| 1 | chrX | 12 | 14 |
| 2 | chrM | 1 | 5 |
| 3 | chrM | 3 | 8 |
bioframe.select(df_unsorted,'chrX:8-14')
| chrom | start | end | |
|---|---|---|---|
| 2 | chrX | 12 | 14 |
| 3 | chrX | 8 | 10 |
#import hg
#hg.Viewconf.from_url("https://resgen.io/api/v1/viewconfs/Y_omIrpERgG01VsqmtMLVA/?raw=1")
Figure from paper
ctcf_peaks = bioframe.read_table(
"https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
schema='narrowPeak'
)
ctcf_peaks
| chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr19 | 48309541 | 48309911 | . | 1000 | . | 5.04924 | -1.0 | 0.00438 | 185 |
| 1 | chr4 | 130563716 | 130564086 | . | 993 | . | 5.05052 | -1.0 | 0.00432 | 185 |
| 2 | chr1 | 200622507 | 200622877 | . | 591 | . | 5.05489 | -1.0 | 0.00400 | 185 |
| 3 | chr5 | 112848447 | 112848817 | . | 869 | . | 5.05841 | -1.0 | 0.00441 | 185 |
| 4 | chr1 | 145960616 | 145960986 | . | 575 | . | 5.05955 | -1.0 | 0.00439 | 185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 40582 | chr8 | 22574315 | 22574744 | . | 1000 | . | 561.11939 | -1.0 | 4.90268 | 243 |
| 40583 | chr15 | 56246029 | 56246402 | . | 1000 | . | 569.05663 | -1.0 | 4.90268 | 191 |
| 40584 | chr1 | 150979463 | 150979845 | . | 1000 | . | 580.28482 | -1.0 | 4.90268 | 194 |
| 40585 | chr16 | 57649040 | 57649402 | . | 1000 | . | 602.95266 | -1.0 | 4.90268 | 173 |
| 40586 | chr12 | 54379625 | 54380042 | . | 1000 | . | 627.60723 | -1.0 | 4.90268 | 203 |
40587 rows × 10 columns
jaspar_url = 'http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/'
jaspar_motif_file = 'MA0139.1.tsv.gz'
ctcf_motifs = bioframe.read_table(
jaspar_url + jaspar_motif_file,
schema='jaspar',
skiprows=1
)
ctcf_motifs
| chrom | start | end | name | score | pval | strand | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 11163 | 11182 | CTCF | 811 | 406 | - |
| 1 | chr1 | 11222 | 11241 | CTCF | 959 | 804 | - |
| 2 | chr1 | 11280 | 11299 | CTCF | 939 | 728 | - |
| 3 | chr1 | 11339 | 11358 | CTCF | 837 | 455 | - |
| 4 | chr1 | 11401 | 11420 | CTCF | 829 | 439 | - |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 770044 | chrY_KI270740v1_random | 4600 | 4619 | CTCF | 815 | 412 | - |
| 770045 | chrY_KI270740v1_random | 5601 | 5620 | CTCF | 815 | 412 | - |
| 770046 | chrY_KI270740v1_random | 7601 | 7620 | CTCF | 815 | 412 | - |
| 770047 | chrY_KI270740v1_random | 8602 | 8621 | CTCF | 815 | 412 | - |
| 770048 | chrY_KI270740v1_random | 19753 | 19772 | CTCF | 806 | 395 | - |
770049 rows × 7 columns
df = bioframe.overlap(
ctcf_peaks,
ctcf_motifs,
# suffixes=('_1', ''),
return_index=True,
how='left',
)
motifs_per_peak = df.groupby(["index"])["index_"].count().values
bins = np.arange(0, np.max(motifs_per_peak))
counts, _ = np.histogram(motifs_per_peak, bins)
plt.bar(bins[:-1], height=counts, align='center')
plt.xlabel('number of overlapping motifs per peak')
plt.ylabel('number of peaks')
plt.semilogy();
print(f'fraction of peaks without motifs {np.round(np.sum(motifs_per_peak==0)/len(motifs_per_peak),2)}');
fraction of peaks without motifs 0.14
df
| index | chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | index_ | chrom_ | start_ | end_ | name_ | score_ | pval_ | strand_ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | chr19 | 48309541 | 48309911 | . | 1000.0 | . | 5.04924 | -1.0 | 0.00438 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 1 | 1 | chr4 | 130563716 | 130564086 | . | 993.0 | . | 5.05052 | -1.0 | 0.00432 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 2 | 2 | chr1 | 200622507 | 200622877 | . | 591.0 | . | 5.05489 | -1.0 | 0.00400 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 3 | 3 | chr5 | 112848447 | 112848817 | . | 869.0 | . | 5.05841 | -1.0 | 0.00441 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 4 | 4 | chr1 | 145960616 | 145960986 | . | 575.0 | . | 5.05955 | -1.0 | 0.00439 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48626 | 40584 | chr1 | 150979463 | 150979845 | . | 1000.0 | . | 580.28482 | -1.0 | 4.90268 | 194.0 | 40407 | chr1 | 150979668 | 150979687 | CTCF | 925.0 | 678.0 | - |
| 48627 | 40585 | chr16 | 57649040 | 57649402 | . | 1000.0 | . | 602.95266 | -1.0 | 4.90268 | 173.0 | 261215 | chr16 | 57649185 | 57649204 | CTCF | 918.0 | 656.0 | - |
| 48628 | 40585 | chr16 | 57649040 | 57649402 | . | 1000.0 | . | 602.95266 | -1.0 | 4.90268 | 173.0 | 261216 | chr16 | 57649229 | 57649248 | CTCF | 829.0 | 439.0 | + |
| 48629 | 40586 | chr12 | 54379625 | 54380042 | . | 1000.0 | . | 627.60723 | -1.0 | 4.90268 | 203.0 | 153951 | chr12 | 54379782 | 54379801 | CTCF | 863.0 | 511.0 | - |
| 48630 | 40586 | chr12 | 54379625 | 54380042 | . | 1000.0 | . | 627.60723 | -1.0 | 4.90268 | 203.0 | 153952 | chr12 | 54379835 | 54379854 | CTCF | 885.0 | 564.0 | - |
48631 rows × 19 columns
%%bash
echo "Hello World"
which sed
Hello World /bin/sed
import hg
config = hg.Viewconf.from_url('https://higlass.io/api/v1/viewconf?d=default')
config